Skip to content

[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes#4399

Open
justinyeh1995 wants to merge 9 commits intoray-project:masterfrom
justinyeh1995:fix/4285-rayjob-sidecarmode-terminal-condition
Open

[Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes#4399
justinyeh1995 wants to merge 9 commits intoray-project:masterfrom
justinyeh1995:fix/4285-rayjob-sidecarmode-terminal-condition

Conversation

@justinyeh1995
Copy link
Contributor

@justinyeh1995 justinyeh1995 commented Jan 15, 2026

Why are these changes needed?

The submitter sidecar may exit during transient head‑node spikes.

The fix has two layers:

Level What it does K8s version KubeRay feature gate
1 Consults the Ray dashboard for the job’s actual status before marking the RayJob as failed All Always on
2 Enables per-container restart rules for the submitter container so non-zero exits restart the container and re-attach the log 1.34+ (and cluster ContainerRestartRules enabled) SidecarSubmitterRestart

f SidecarSubmitterRestart is enabled on a cluster with K8s < 1.34, the operator will fail fast at startup.

Related issue number

Closes #4285

Testing

Level 1: kind v1.26 (less than 1.34) / no feature gate enabled

  1. Create a kind cluster with node image v1.26 and follow the guide to build the image and load it into kind cluster
kind create cluster --name test-cluster --image kindest/node:v1.26.0

cd ray-operator
IMG=kuberay/operator:nightly make docker-build

kind load docker-image kuberay/operator:nightly --name test-cluster
  1. Apply manifest, a slightly modified version of ray-job.sidecar-mode.yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sidecar-mode
spec:
  # In SidecarMode, the KubeRay operator injects a container into the Ray head Pod to submit the Ray job and tail logs.
  # This will avoid inter-Pod communication, which may cause network issues. For example, some users face WebSocket hangs.
  # For more details, see https://github.com/ray-project/kuberay/issues/3928#issuecomment-3187164736.
  submissionMode: "SidecarMode"
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  rayClusterSpec:
    rayVersion: '2.52.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                cpu: "1"
                memory: "5Gi"
              requests:
                cpu: "1"
                memory: "2Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          - name: code-sample
            configMap:
              name: ray-job-code-sample
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "1"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests
    import time

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"
    
    # keep job alive long enough to kill submitter mid-run
    print("Entering long-running phase (5 minutes)...")
    for i in range(300):
        ray.get(counter.inc.remote())
        if i % 10 == 0:
            print(f"tick={i}, {ray.get(counter.get_counter.remote())}")
        time.sleep(1)

    print("Done.")
  1. Watch RayJob status till it becomes running
kubectl get rayjob rayjob-sidecar-mode -w
  1. . Disrupt the sidecar container
CLUSTER=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.rayClusterName}')
HEAD_POD=$(kubectl get pods -l ray.io/cluster=$CLUSTER,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

# we cannot kill pid 1 within the container 
# e.g. kubectl exec $HEAD_POD -c ray-job-submitter -- sh -c 'pkill -f "ray job log"'
# so we stop the container instead

CONTAINER_ID=$(kubectl get pod $HEAD_POD -o jsonpath='{.status.containerStatuses[?(@.name=="ray-job-submitter")].containerID}' | sed 's|containerd://||')
docker exec -it test-cluster-control-plane crictl stop $CONTAINER_ID
  1. Verify RayJob is still running
kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobDeploymentStatus}'
Screen.Recording.2026-02-04.at.9.16.01.PM.-.Compressed.with.FlexClip.mp4

kind v1.26 (less than 1.34) / feature gate enabled

  1. Create KIND v1.26 cluster and load the image into the cluster
kind create cluster --name test-cluster --image kindest/node:v1.26.0
kind load docker-image kuberay/operator:nightly --name test-cluster
  1. Install operator with feature gate enabled:
helm upgrade --install kuberay-operator ../helm-chart/kuberay-operator \
  --set image.repository=kuberay/operator \
  --set image.tag=nightly \
  --set featureGates\[0\].name=SidecarSubmitterRestart \
  --set featureGates\[0\].enabled=true
k get pods
# It is expected that the operator exits with CrashLoopBackOff

k logs kuberay-operator-58f4998f5d-2jc6k
# and the logs mention SidecarSubmitterRestart feature gate requires K8s 1.34+
image

kind v.134+ / feature gate enabled

  1. Create a cluster with v1.34+ images and ContainerRestartRules enabled, v1.35 enable it by default
kind create cluster --name test-cluster --image kindest/node:v1.35.0
kind load docker-image kuberay/operator:nightly --name test-cluster
  1. Enable the feature gate for the ray operator
helm upgrade --install kuberay-operator ../helm-chart/kuberay-operator \
  --set image.repository=kuberay/operator \
  --set image.tag=nightly \
  --set featureGates\[0\].name=SidecarSubmitterRestart \
  --set featureGates\[0\].enabled=true

  1. apply the same manifest pasted above

  2. record the job id

JOB_ID=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobId}')
  1. disrupt the sidecar container
CLUSTER=$(kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.rayClusterName}')
HEAD_POD=$(kubectl get pods -l ray.io/cluster=$CLUSTER,ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')

CONTAINER_ID=$(kubectl get pod $HEAD_POD -o jsonpath='{.status.containerStatuses[?(@.name=="ray-job-submitter")].containerID}' | sed 's|containerd://||')
docker exec -it test-cluster-control-plane crictl stop $CONTAINER_ID
  1. verify RayJob does not fail
kubectl get rayjob rayjob-sidecar-mode -o jsonpath='{.status.jobDeploymentStatus}'
  1. verify the submitter container actually restarted
kubectl get pod $HEAD_POD -o jsonpath='{range .status.containerStatuses[*]}{.name}{" restartCount="}{.restartCount}{"\n"}{end}'

# the restartCount should +1
  1. verify the Ray job is still running with the same job id
kubectl exec $HEAD_POD -c ray-head -- ray job status --address=http://127.0.0.1:8265 "$JOB_ID"
image image

Checks

  • Testing Strategy
    • Unit tests
    • Manual tests

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
@justinyeh1995 justinyeh1995 changed the title [WIP][Fix][RayJob SidecarMode] prevent premature job termination during transient head node spikes [WIP][Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes Jan 15, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we should make sure this only be used when k8s version >= 1.34
  2. we should also have some mechanism to check user open the alpha feature via feature gate.

@justinyeh1995
Copy link
Contributor Author

  1. we should make sure this only be used when k8s version >= 1.34
  2. we should also have some mechanism to check user open the alpha feature via feature gate.

Appreciate the concrete suggestions.

I am wondering whether we should keep the current fix (checking dashboard before applying timeout) as a fallback for users with k8s version < 1.34.

Or should we simply focus on solving it for k8s version >= 1.34. My judgement is these two can co-exist.

@justinyeh1995
Copy link
Contributor Author

After discussing offline with @Future-Outlier, this pr will instead focus on implementing restarting submitter for k8s version 1.34+.

…er-container restart

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
…test

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
@justinyeh1995 justinyeh1995 changed the title [WIP][Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes [Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes= Jan 24, 2026
…rt feature gate is enabled at operator startup

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
@justinyeh1995 justinyeh1995 changed the title [Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes= [Fix][RayJob SidecarMode] Prevent premature job termination during transient head node spikes Jan 28, 2026
Comment on lines +629 to +630
Operator: corev1.ContainerRestartRuleOnExitCodesOpNotIn,
Values: []int32{0},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok if we restart on any non-zero error code?

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
@justinyeh1995 justinyeh1995 marked this pull request as ready for review January 31, 2026 08:16
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
@justinyeh1995 justinyeh1995 marked this pull request as draft February 4, 2026 10:39
@justinyeh1995 justinyeh1995 marked this pull request as ready for review February 4, 2026 10:40
Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
@CheyuWu CheyuWu self-requested a review February 6, 2026 17:32
@justinyeh1995
Copy link
Contributor Author

cc @seanlaii @troychiu to review if you get a chance to, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Submission mode SidecarMode causes job termination when head node CPU spikes

2 participants